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Abstract 

Numerically exact investigations of interacting spin systems provide a major tool 
for an understanding of their magnetic properties. For medium size systems the ap- 
proximate Lanczos diagonalization is the most common method. In this article we 
suggest two improvements: efficient basis coding in subspaces and simple restruc- 
turing for openMP parallelization. 
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1 Introduction 



Many magnetic materials can accurately be described by the Heisenberg or 
related effective spin models. Due to the vastly increasing size of the under- 
lying Hilbert space, which grows as (2s + 1)^ for N spins of spin quantum 
number s, only small spin systems can be modeled exactly, i.e. their complete 
eigenspectrum can be determined. For larger systems approximate methods 
such as the Lanczos [1] or related methods like the Arnoldi, the projection, or 
the Density Matrix Renormalization Group (DMRG) method [2f3f4"] are used. 
They usually aim at properties of ground states in orthogonal subspaces, which 
are provided by symmetry, see e.g. [5|6|7] . But also thermal properties can be 
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addressed by means of a finite-temperature Lanczos method [8] as done for 
instance for the evaluation of certain Kondo lattice models in Ref. [9]. 

For all these methods it is of course advantageous to use the present symme- 
tries in order to reduce the size of the Hamiltonian matrix as much as possible 
by decomposing the Hilbert space into mutually orthogonal subspaces. One 
obvious symmetry is the rotational invariance of many models with respect to 
rotations about the z-axis in spin space. This leads to a decomposition of the 
total Hilbert space TC into orthogonal subspaces TC(M) characterized by their 
total magnetic quantum number M. The related basis, which is a subset of the 
full basis, should then efficiently be encoded. In nowadays applications these 
basis states are either stored in tables and assessed via hash search methods, 
see e.g. [10] . or encoded using the two-dimensional representation by Lin 
which needs two vectors of size ~ (2s + l)^/ 2 ) for encoding. In this article 
we will provide direct algorithms for encoding and decoding of basis states in 
subspaces TC(M). 

Thanks to available SMP (symmetric multiprocessing) computers with large 
shared memory Lanczos vectors of considerable size can be processed. An 
example is given in Ref. [7] where Lanczos vectors with about 10 9 entries were 
used. We show that by a simple reformulation of the typical implementation 
of the Lanczos algorithm a very sufficient parallelization with openMP can be 
achieved that avoids write conflicts. 

The article is organized as follows. The next section shortly introduces the 
Heisenberg model as an archetypical example. In section [3J we introduce the 
new basis encoding in subspaces 7i{M). The last section @] deals with paral- 
lelization issues. 



2 Heisenberg Hamiltonian and basis encoding 

Spin systems are very often modeled by effective spin Hamiltonian such as the 
isotropic Heisenberg Hamiltonian 

# = ~E Juvi(u)-£(V) . (1) 

u,v 

s(u) are the individual spin operators at sites u. J uv are the matrix elements 
of the symmetric coupling matrix. In the following we will assume that all 
spin quantum numbers are equal, i.e. si = $2 = ■ ■ ■ = sn = s. 

The starting point for any diagonalization is the product basis of the single- 
particle eigenstates of all s z (u) 
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s z (u) I mi, ... , m u , ...,m N ) = m u \mi,..., m„, . . . ,m N ) . (2) 

These states are sometimes called Ising states. They span the full Hilbert 
space and are used to construct symmetry-related basis states. For encoding 
purposes, and since m u can be half-integer, they are usually rewritten in terms 
of quantum numbers a u = s — m u instead of m u , where a u = 0, 1, . . . , 2s. 
The number of basis states, i.e. the dimension of the full Hilbert space, is 
dim (H) = (2s + 1)^. The complete basis set | a±, . . . , a u , . . . , a/v ) provides 
itself a natural encoding given by the number system with basis (2s + 1). To 
give an example, the basis of a system of 8 spins s = 1 can be completely 
and easily encoded using all 8-digit numbers where each digit can assume the 
values 0, 1, 2: 



|0,0,0,0,0,0,0,0) (3) 

1 1,0,0,0,0,0,0,0) 

|2,0,0,0,0,0,0,0) 

|0,1,0,0,0,0,0,0) 

12,2,2,2,2,2,2,2) . 
3 Basis encoding in H(M) 

The basis in the subspace H(M) is given by all product states \a ly . . . ,a^) 
with M = Ns — J2 U a u- Fo r usage in a computer program they need to be 
assigned to integer numbers 1, . . . , dim (7i(M)). The reason is that one usu- 
ally does not need the basis only once at initialization, but at every Lanczos 
iteration, since the sparse Hamiltonian matrix is not stored, but its non-zero 
matrix elements are evaluated whenever needed using 

( i I H \j ) = ( ai, . . . , < | H | a{, . . . , a j N ) . (4) 

For a direct coding algorithm of basis states in subspaces TC(M) it is advan- 
tageous that the the sizes of the subspaces H(M) are known analytically [12]. 
Thus an array can be built at startup that contains for a fixed s the sizes of 
these subspaces 7i(M = Ns — A) for given iV and A. We will call this array 
D(N,A). It will be used to determine the sequential number of a basis vec- 
tor in Ti,(M). The recursive buildup is performed using the following relation 
between the sizes of subspaces 



D(N,A) = jr D{N-l,A-k) , (5) 

k=0 
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with D(N = 1, A = 0,1, ... ,2s) = 1, D(N, A = 0) = 1, and D(N, A = 1) = 
N. If A £ {0, 1, ... , 2Ns} then D(N, A) = 0. 



3.1 i=> |a|,...,aV) 

One coding direction, i =>■ | aj, . . . , a % N }, which is the more trivial direction, 
can be realized in several ways. If the basis is not too big one simply gener- 
ates all basis states of the subspace TC(M) in lexicographical order, compare 
(13]), and stores the quantum numbers a\ of the ith vector in an array. The 
generation can either be performed by running through all basis states ([3]) 
and sorting out those which comply with the condition M = Ns — J2u a u or 
by algorithms that generate only those basis states that obey the condition 
already. 

A direct algorithm i =>- \ a\, . . . ,a % N ) using the known dimensions of the sub- 
spaces TC(M = Ns — A) could be realized as followg3 

m=0 

Ak = A 
do k=N,2,-l 
do n=0,2*s 
if (i.le. (m+D(k-l,Ak-n+l))) then 
BasisVector (k) = n 
Ak = Ak - n 
goto 100 
else 

m = m + D(k-l,Ak-n+l) 
endif 
enddo 
100 continue 
enddo 

BasisVector(l) = Ak 

BasisVector contains the N entries a&. This algorithm will be made clearer 
when we explain the inverse algorithm in subsection 13. 21 

Nevertheless, since a Lanczos routine would run through a state vector along 
the lexicographical order of basis states one would actually only need a func- 
tion that generates for a given basis state the succeeding basis state. To un- 



The given code uses FORTRAN notation. Nevertheless, it can be easily trans- 
formed into C. One should only pay attention to the fact that field indices in FOR- 
TRAN start at 1 not at 0. Therefore, the definition of the second field index of D 
has been modified accordingly. 
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derstand how this works it is helpful to picture the basis states | a\, . . . , a l N ) 
as distributions of exactly A = J2u a u balls in N boxes, where each box can 
contain at most 2s balls. Thus the lexicographically lowest state is given by the 
distribution where the boxes are filled sequentially starting with the leftmost 
box, i.e. entry number 1. 

How does one advance from one basis state to the succeeding one? 

(1) Find the leftmost position k for which the entry is nonzero and the next 
entry is less than 2s. If such a position does not exist, then there is no 
succeeding basis state. 

(2) Take one (ball) out of entry (box) k and add it to the next entry to the 
right, i.e. entry (box) with index k + 1. 

(3) Empty all entries (boxes) 1 to A; and fill this content (these balls) into 
the entries (boxes) starting from the left in lexicographical order. 

Take as an example for iV = 8, s = 3/2, and A = 6 the state | 0, 0, 0, 2, 3, 1, 0, 0). 
Entry number k = 5 from the left is the first position to fulfill the first con- 
dition. One out of the 3 is put into k = 6 yielding 2 there. Then the content 
of entries k = 1, . . . , 5 is taken and filled into the entries starting from the 
left. This content is 4 in the present example. Three out of the four can be 
filled into entry number 1. The rest fits into entry number 2. Therefore, the 
resulting basis state is | 3, 1, 0, 0, 0, 2, 0, 0). 

3.2 \a\,...,a%) i 

The inverse direction is actually the nontrivial one, since the basis vectors 
are only a subset of the full basis set (J3]). Therefore, for the latter coding 
direction search algorithms are employed, e.g. [10], or the two-dimensional 
representation of Lin [11] is used, which needs two vectors of size ~ (2s+l)^ N ^ 2 ^ 
to encode all basis states. 

The position of a basis vector | a 1; . . . , ) in the lexicographically ordered 
list of vectors will be determined by evaluating how many vectors lay before 
this vector. For this purpose the known dimensions of the subspaces 7i{M = 
Ns—A) are used again. We explain this procedure with an instructive example. 
Assume we investigate a spin system with N = 4 and s = 3/2 in a subspace 
of A = 6, i.e. M = 0. Our example basis vector is | 1,0,2,3). In the list of 
basis vectors all vectors fulfilling one of the following criteria are listed before 
the example vector, the respective dimensions will be added: 

• Vectors with 0, 1, or 2 instead of 3 as the first (rightmost) figure: Their 
dimensions are D(3, 6), -D(3, 5), and -D(3, 4), respectively, since the condition 
that J2i a i — A must be fulfilled in total. 
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• Out of all vectors where the first figure is 3, those where the second figure 
is or 1 are listed before, thus their respective dimensions of D(2, 3) and 
D(2, 2) must be added. 

• This procedure continues until the last figure. In the present example this 
yields for the third figure and simply for the last figure. 

• Thus the number of the present vector is given by the sum of the mentioned 
dimensions plus one. 

In a computer program one can evaluate the position i of | a\, . . . , ajv } in the 
list of basis vectors according to 

Ak = A 
i = 1 

do k=N,2,-l 

do n=0,BasisVector(k)-l 
i = i + D(k-l,Ak-n+l) 
enddo 

Ak = Ak - BasisVector(k) 
enddo 

BasisVector contains the N entries a^. If the array of dimension D(N, A) is 
properly initialized, i.e. the field value is zero for non-valid combinations of N 
and A, then the sum can be performed in a computer program without paying 
attention to the restrictions for the indices. 



4 Parallel Lanczos implementation on SMP machines 

Parallelization of the Lanczos or similar methods aims at a parallelization of 
the basic matrix- vector operations. This has been reported as being extremely 
difficult due to prohibitive communication costs [T3"|IT4] . In this section we show 
that parallelization is possible if (1) the sparse matrix is not stored but matrix 
elements are evaluated whenever needed and (2) the loops for matrix-vector 
multiplication are rearranged. 

The basic step of a Lanczos or a similar method consists in the (repeated) 
application of the sparse matrix, i.e. the Hamiltonian, onto an initial trial 
vector 

dim(W(M)) 

(i|V>2>= E (i\H\j)(j\i>i) ■ (6) 

Here are the entries of the initial column vector the resulting 

vector is | ip2 )■ 
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(1) Although the Hamiltonian matrix ( i \ H \ j ) is sparse, it typically contains 
an order of iV 1 "' 2 x dim (7i(M)) non-zero entries, for instance for Heisenberg 
systems. For very large dimensions, e.g. of order 10 9 , this would easily amount 
to several dozens of Gigabytes. Therefore, it would be better not to store the 
matrix, but to evaluate the matrix elements whenever needed. 

(2) A typical implementation would have the loop about j as the outermost 
loop. An entry of the initial vector would be read, then the non-zero 
matrix elements (i\H\j) would be determined, and the resulting products 
would be written into the respective entries ( i \ ip2 ) ■ When parallelizing the 
loop about j this leads to write conflicts since different initial entries may 
result in the same final one. 

It turns out that both problems can be solved together in cases where the 
application of the Hamiltonian onto each basis state is known analytically. 
In these cases only the non-vanishing matrix elements will be generated by 
applying the Hamiltonian, e.g. ([1]), onto the final basis state | i). This yields 
for a given final index i a set of initial indices where only these indices 

contribute in the sum in Eq. (Q. 



(i\fo)= £ (i\H\j)(j\^) . (7) 
{?'(»*)} 

Therefore, one would rewrite Eq. (jH]) as Eq. (J7]) and in a parallel computer 
program let i be the outer loop. Then one determines for every final entry 
( % | ip2 ) those initial entries that contribute with non-zero ( % \ H \ j ) 

in the sum ([7]). It may happen that at runtime different threads read the same 
entry of the initial vector, but this is harmless. 
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Fig. 1. Scaling of CPU time for 200 Lanczos iterations with number of allowed 
threads. The machine has eight cores. 

Figure [1] shows as an example the scaling of CPU time for 200 Lanczos it- 
erations with a vector of length 484, 500. The program and all subroutines 
are written in Fortran and compiled with the INTEL Fortran compiler using 
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openMP directives. The linear scaling is almost perfect. Slight deviations are 
due the non-parallel parts of the program, especially the initialization. 

Summarizing, in this article we provide a coding algorithm for spin basis states 
in subspaces Tt(M) and demonstrate that a rearrangement of loops allows an 
efficient parallelization of the Lanczos algorithm. The proposed improvements 
can easily be ported to similar methods such as Arnoldi or projection method. 
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